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ABSTRACT 


Spatial visualization skills are essential and fundamental to 
studying STEM subjects, and sketching is an effective way 
to practice those skills. One significant challenge of support- 
ing practice using sketching questions is the vast number of 
possible mistakes, making it time-consuming for instructors 
to provide customized and actionable feedback to students. 
The same challenge persists for computer programs as well. 
This paper introduces a clustering model designed to catego- 
rize sketching answers based on the severity and character- 
istics of their mistakes. The model is designed to be used by 
a computer-based training platform to provide customized, 
actionable formative feedback to students in real-time. The 
promising results also suggest a new and comprehensive set 
of evaluation criteria to assess a student’s performance on 
sketching questions. As a broader contribution, our work 
is a proof-of-concept for a modeling approach to automat- 
ically evaluate and provide formative feedback on complex 
free-hand sketches using abstract features that may be gen- 
eralized to a variety of disciplines that involve the creation 
of technical drawings. 
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1. INTRODUCTION 


Spatial visualization is the ability to represent and mentally 
manipulate two-dimensional and three-dimensional objects 
[11]. A body of research has shown that good spatial vi- 
sualization skills help students succeed in STEM education 
[39, 3, 13, 25, 27, 32, 41, 44]. It is encouraging that existing 
research also demonstrates that spatial visualization skills 
are malleable and can be trained and improved, for exam- 
ple, via forms of workshops and seminars [42]. There have 
been successes in increasing the retention rates of STEM 
freshmen students with spatial visualization skills training 
in recent years, especially for minority groups such as female 
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students [39, 23]. 


Besides multiple-choice questions that are traditionally used 
in spatial visualization training, free-hand sketching on grid 
paper is an effective type of practice question [38]. Sketching 
questions can imitate the sketching tasks required in many 
engineering disciplines, which is particularly helpful since 
sketching is a fundamental skill for engineering designs [22]. 
In the training process, since students gain from learning 
from their mistakes instead of failing in the first try and 
giving up based on the immediate-feedback assessment tech- 
nique [26], students can benefit from having a second chance 
on a practice problem. However, providing formative feed- 
back while not giving away the answer, which is known to 
support self-regulated learning [28], on free-hand sketching 
can be challenging due to the wide variety of possible incor- 
rect answers on such activities. 


While human instructors possess the capability to analyze 
an erroneous free-hand sketch, identify the source of po- 
tential errors and provide formative feedback, it is a time- 
consuming process and providing such feedback to a large 
student population would require prohibitive efforts that 
would likely prevent the feedback from being provided in 
a timely fashion [2]. Computer-based systems able to pro- 
vide timely formative feedback can be considered as an al- 
ternative to address this limitation. However, one significant 
challenge to automatically providing immediate customized 
feedback for sketching questions is the need for a computer- 
based system to be able to recognize and understand how 
much an answer is different from the answer key and the 
types of mistakes students are making. 


On the one hand, sketching questions have an enormous 
number of possible incorrect answers, which are often spe- 
cific to a unique problem, making it difficult, if not impos- 
sible, to identify every possible error and to prepare unique 
feedback for each one. As an alternative, a computer-based 
system could be designed to recognize categories of answers 
based on the severity or characteristics of their errors and 
provide feedback relevant to each one. However, to the best 
of our knowledge, there is no existing research that catego- 
rizes answers to complex sketching questions based on their 
errors, either conceptually or computationally. The lack of 
solution motivated us to identify patterns that exist in stu- 
dents’ erroneous sketching answers and create a computer- 
based algorithm that can categorize them in real-time. 
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Due to the lack of existing categories of erroneous answers in 
free-hand sketching problems, we propose the use of a clus- 
tering approach to identifying such categories. Our research 
questions are the following: 


RQ1 What categories exist in students’ sketching answers 
based on the severity and characteristics of their er- 
rors? 


RQ2 How meaningful are the identified erroneous answer 
categories, and what actionable feedback can be pro- 
vided for each category? 


We constructed a list of features that can be used to char- 
acterize students’ erroneous sketching answers. Using a k- 
mean clustering approach, we discovered six common answer 
categories for incorrect sketches that are distinct from one 
another according to the severity and characteristics of the 
errors. Our clustering results suggest a new set of evaluation 
criteria for complex free-hand sketching answers that is more 
interpretable and generalizable than those in prior work [7, 
43, 5]. Also, we provide initial suggestions for the kinds 
of formative feedback appropriate for each answer category 
without giving away the answer [36]. 


To the best of our knowledge, our study is the first to identify 
categories of erroneous sketches, both computationally and 
conceptually, in spatial visualization sketching problems us- 
ing abstract features. Our approach also has the potential to 
be generalized to other subject areas that require sketching 
practices, mostly technical drawings in various Engineering 
and Science subjects, such as circuit diagrams in Electri- 
cal Engineering, engine models in Mechanical Engineering, 
building plans in Architecture, and structural formula in Or- 
ganic Chemistry. 


2. RELATED WORK 
2.1 Spatial Visualization Skills and Sketching 


Spatial visualization skills were estimated to play an impor- 
tant role in 84 careers [37], most of which are STEM-related. 
A longitudinal study showed that psychometrically-assessed 
spatial ability predicts career in STEM fields after account- 
ing for Math and Verbal aptitudes [45]. 


Spatial visualization skills are applied in various STEM ar- 
eas. Research shows that students with better spatial vi- 
sualization skills perform better in Chemistry [32, 6]. In 
Organic Chemistry, for example, students with strong spa- 
tial visualization skills draw preliminary figures more often. 
Hence they use figures to gain a better understanding of 
the questions and are more likely to answer them correctly 
[32]. Another body of research revealed the connection be- 
tween spatial skills and Geoscience [17, 30]. In particular, 
students with strong visual penetration ability, e.g., imag- 
ining cross-sections, perform better in Geology [17]. Fur- 
thermore, understanding cross-sectioning is a basic skill in 
many other engineering subjects [9, 12]. Spatial visualiza- 
tion is also found to be tightly related to performance in 
Anatomy in Biology [34], Radiology in Medicine [16]. 


A wide variety of empirical research has shown that spa- 
tial visualization skills are malleable. Interventions designed 


to improve spatial visualization skills reach, on average, a 
medium effect size of 0.47 [42]. A well-known training devel- 
oped by Sorby (2009) showed significant post-test improve- 
ment for each class of college students over a 6-years-long 
study. In particular, Sorby found that the training signifi- 
cantly improved female students’ retention rate but not that 
of male students [39]. The finding suggested the critical role 
of spatial visualization skills training in increasing the diver- 
sity of STEM field students. 


Sketching ability is fundamental to engineering design [22] 
and highly correlates with many STEM subjects [35]. To 
improve spatial visualization skills, sketching is one of the 
most effective approaches [38]. Electronic sketching has also 
demonstrated potential in training spatial visualization skills 
[8, 47]. Thus, the application of sketching practice is worth 
studying for better improving spatial visualization skills. 


2.2 Computer-based Evaluation and Forma- 
tive Feedback for Sketches 


To the best of our knowledge, there is no prior work on 
the evaluation of sketches in spatial visualization training, 
both conceptually or computationally. The use of computer- 
based formative feedback for spatial visualization sketching 
has not been studied either. There is a body of research 
on computer-based evaluation and formative feedback for 
other types of sketches [5, 7, 43, 40, 15, 18, 19, 20]. How- 
ever, some of them are too simple or too domain-specific to 
be generalized to a complicated case as in spatial visualiza- 
tion sketches. Others’ evaluation methods cannot provide 
actionable or easy-to-interpret formative feedback. 


For free-hand sketching that is evaluated mostly based on 
the shape and structure, there are a few existing evaluation 
approaches in domains other than spatial visualization train- 
ing. Bhat (2017) developed Skechography, a river-sketching 
auto-grading tool for Geology [5]. This tool could perform 
sketch recognition and compare the river’s shape similarity 
using the Shape Context algorithm, the distances of start 
points and endpoints between a student’s answer and the 
answer key. Based on the degree of similarity and distances, 
the tool provided a score that was a weighted sum of these 
three features. Skechography evaluated a river, which had 
only one line with specific features of a start point, an end- 
point, and the shape of the line. The simplicity of this ap- 
plication has a weak external validity and cannot be used in 
evaluating spatial visualization sketches. 


The work by Chandan et al. (2018) [7], on the other hand, 
worked on a complicated case of free-hand drawing of objects 
of specific categories, e.g., a bee, an airplane, etc. They ap- 
plied a Convolutional Neural Network approach for object 
categorization and a Scale Invariant Feature Transform ap- 
proach to check the similarity between a given sketch and 
the ”standard” sketch. As feedback, the tool showed the per- 
centage of similarity to various categories of objects. The use 
of deep learning methods made the interpretation of results 
challenging. Hence, this approach is limited in its capability 
to generate specific and actionable feedback to help students 
improve their answers. 


Mechanix, a sketch-based tutoring system for learning forces 
applied on a truss, could provide specific feedback to free- 
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hand sketching of forces [43]. In this case, the errors that 
could occur were known and clearly defined on an arrow- 
basis. Given the small number of arrows, it is relatively easy 
to cater specific and actionable feedback to each error. In 
the case of spatial visualization sketches, a sketch contains 
far more number of lines, making it infeasible to provide a 
piece of feedback for each line. 


There exists another body of work that focused on the recog- 
nition of East Asian characters, which are similar to a sim- 
ple sketch [40]. However, these solutions applied an all or 
nothing” approach to recognize the structure of a charac- 
ter, which was not helpful in providing specific formative 
feedback. A few other works aimed to evaluate and pro- 
vide feedback on the quality or aesthetics of a sketch, but 
not on the correctness in terms of the structure of shape 
[15, 18]. There is also an evaluation approach for computer- 
aided design solid models specifically, using criteria related 
to parameters set in the computer-aided model, which does 
not apply to free-hand sketching because the concept of pa- 
rameters is not intuitive in free-hand sketching [19, 20]. 


Overall, there is limited work on a computer-based evalua- 
tion of complex free-hand sketching based on structural cor- 
rectness that can generate specific and actionable formative 
feedback. Our work aims to fill in this gap. 


2.3 Answers Categorization in Content-based 
Automated Evaluation 


In evaluating constructed response automatically from a content- 


based perspective, there is a rich body of work in evaluat- 
ing short answer questions for a variety of subjects and do- 
mains [24]. However, except for the studies mentioned in the 
last section, there is very few existing literature related to 
the content-based evaluation of complex free-hand sketch- 
ing. Therefore, we draw our inspiration from the existing 
research in evaluating short answer questions and apply it to 
complex free-hand sketches, a different type of constructed 
response. 


Answer categorization is one of the most frequently used 
approaches to perform a content-based evaluation of short 
answers. In most cases, supervised learning is applied using 
a manually labeled training set based on pre-defined rubrics 
[21, 33, 1, 10, 29]. For example, c-rater applied NLP tech- 
niques that determined whether an answer contained each 
key concept and was widely applied on short answer ques- 
tions in Biology, Psychology, Math, and Reading, to not 
only grade but to provide specific real-time feedback [21, 
1]. Pulman and Sukkarieh (2005) experimented with Induc- 
tive Logic Programming, Decision Tree and Naive Bayes to 
classify short answers into the desired category for Biology 
[33]. 


In our case, however, there are neither pre-existing robust 
rubrics as the evaluation standard for spatial visualization 
sketches nor known categories of error. This brought dif- 
ficulties to label a training set manually accurately. Also, 
most content-based evaluation approaches only provided up 
to three levels of scoring. Some exceptions that provided 
more than three levels of scoring were either unclear about 
the definition of the levels or the levels were only mechanical 
composition of the correct answer [24]. As an alternative, 


Q1/20: For the 
object shown in 
orthographic 


projection below, «| | ee nee | 
construct an q . = . rt . rH . : . . Z . . . . 7 a i . . J 
the box method to oe Sareea a . ay Ome > 
assist you if Dita iaaes Deeceie aO bts oy 
necessary. Boe a ag ee Fo Bae ee ep eel 

q . a . ") . 5 . = . be . . . . S . n . i J 


Figure 1: Free-hand sketching tool for isometric 
sketching on the online spatial visualization train- 
ing platform 


we turned to unsupervised learning to perform answer cat- 
egorization to identify categories that were as granular yet 
meaningful as possible. Clustering is an often-used unsuper- 
vised learning approach in short-answer grading, especially 
in the case of answering open-ended questions. Previous 
work [4, 48] has shown that clustering could group answers 
that are similar in text characteristics, semantics, and top- 
ics. Our work aims to leverage this method to categorize 
complex sketches in spatial visualization training. 


3. METHODS 
3.1 Data Collection 


We collected data from students solving free-hand sketch- 
ing problems in a 100-level engineering course called ”Spa- 
tial Visualization” that utilized an online training platform 
over half a semester in Fall 2019 at our home institution, 
a large public university in the Midwestern United States. 
The online training platform was previously developed as a 
computer-based spatial visualization training platform [47] 
to enable practicing at scale using online exercise and auto- 
matic grading. Previous work has shown a significant im- 
provement in spatial visualization skills for those who com- 
pleted the exercises on the platform [47]. 


Students in the course met once a week in-person for an 
hour, and the majority part of the course was working through 
practice problems on the platform on their own as their 
weekly assignment, given the instructions. The focus of 
practice questions each week was different, depending on 
the particular set of skills that were being trained, such as 
mental rotation, cross-sectioning, and coded plan. The plat- 
form supports both multiple-choice questions and sketching 
questions. Figure 1 and Figure 2 show the free-hand sketch- 
ing tool on the platform that allows students to sketch out 
their answers on the computer. Students can draw and erase 
lines on the grid paper freely. Students could also save their 
sketch when they leave the platform and load what they 
saved when they come back. In the course, students were 
given a maximum of two attempts for each sketching ques- 
tion, i.e., they were given a second chance if they answered 
incorrectly in the first attempt. All the sketching questions 
were graded with an "all or nothing” approach. 


The collected dataset includes 370 incorrect sketches from 
14 students in the course that covers five types of sketch- 
ing questions and 61 unique questions. We excluded correct 
sketches in the categorization because they would naturally 
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Q1/20: Draw the 
orthographic view 
corresponding to 
the isometric view 
provided below. 


Figure 2: Free-hand sketching tool for orthographic 
sketching on the online spatial visualization training 
platform 


be in one category by mapping exactly to the answer key. 
Examples of the types of sketching questions include draw- 
ing the orthographic view of a 3D object given the isometric 
view or vice versa, and drawing the resulting 3D object after 
rotating a given 3D object with a certain degree in a given 
direction. Each type of sketching questions contained a se- 
ries of different questions with 3D objects of various shapes. 
On average, each sketch contains approximately 30 to 80 
lines of unit length. 


Each submission of an attempt to answer a question pro- 
duced a raw log describing their answer. In the raw log, 
two major types of information were recorded. First, it con- 
tained the set of lines in the final submitted sketch. Second, 
it recorded the history of all the timestamped steps a stu- 
dent took of adding or deleting a line, clearing, or loading 
the sketch for that question (Figure. 3). In this paper, we 
focused on the final submitted sketch only since the goal is 
to categorize the final answer instead of analyzing students’ 
process of solving a free-hand sketching problem. 


Each final submitted sketch is represented by the X-Y coor- 
dinates of a list of lines. The lines are further denoted by the 
type of the lines, either solid line or dashed line, which are 
the two standard types of lines used in the sketching exercise 
for different purposes. A sketch is mostly made up of solid 
lines, but a dashed line should be used instead of a solid line 
to represent a hidden edge from a particular perspective. 


Another data point in the raw log is the type of grid paper 
used for a sketch. There are two types of grid paper in the 
sketching exercises: an isometric grid for isometric drawing, 
and a dot grid for orthographic drawing. A sketch is consid- 
ered as correct only if the shape and the size of the object 
match with those of the answer key, and uses the correct 
type of grid paper. The position of where a sketch is drawn 
on the grid paper is flexible. 


We performed two steps of data standardization on the raw 
log before feature extraction. First, we aligned both the stu- 
dent’s answer and the answer key to the lower-left corner of 
the sketch-pad. Second, all the lines were broken down into 
unit length and de-duplicated so that lines that overlapped 
with each other would only be counted once. We conducted 
these two steps for the ease of comparing student’s answers 
against the answer key. 


3.2 Feature Extraction 


Figure 3: An example of a raw log file generated 
from sketching questions on the online spatial visu- 
alization training platform 


We developed a total of 8 features to use as input for our 
clustering model. We performed feature engineering man- 
ually after observing a small subset of the data to get an 
idea of what information human instructors might use when 
interpreting incorrect answers. In order to get a preliminary 
view of possible errors that would be as comprehensive as 
possible, we selected three questions that had the highest 
number of incorrect answers and observed the errors made 
by students on those problems. Based on our preliminary 
observation, we created three categories of features that rep- 
resent different characteristics of the observed errors. 


The first group of features uses a unit-length line as its basic 
unit, i.e., a line connecting adjacent points, and represents 
the number of lines that are wrong compared to the an- 
swer key. We observed from the subset of mistakes that the 
number of incorrect lines involved in a sketch varied widely, 
from only one wrong line to over 80% of lines being wrong. 
The number of incorrect lines is a straightforward way to 
quantify the degree to which a sketch was incorrect. We 
considered three scenarios in which a line is wrong. 


1. An extra line: a line is in the student’s answer, but 
there is no line at the same position in the answer key. 


2. A missing line: a line is in the answer key, but there 
is no line at the same position in the student’s answer. 


3. A line with incorrect type: two lines with the same 
position in the student’s answer and the answer key are 
of different types, i.e., solid line vs. dashed line. 


To normalize the number of incorrect lines against the com- 
plexity of the sketch, we adopted the percentage of wrong 
lines instead of the absolute number, i.e., dividing by the 
total number of lines in a sketch. The three features in this 
group are Percentage of Extra Lines, Percentage of Missing 
Lines, and Percentage of Lines with Correct Position but 
Incorrect Type. 


The second category of features represents the groupings of 
the incorrect lines based on their location in a sketch. In 
our preliminary observation, we found that, between two 
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Figure 4: An example of a sketch (on the right) with 
four error components, i.e., four sites of mistakes. 
The sketch on the left is the answer key. 


sketches with a similar number of incorrect lines, the incor- 
rect lines may be inter-connected and concentrated in one 
place in a sketch while being scattered in multiple spots in 
another sketch. These two cases represented the mistakes of 
different natures. 


Based on the assumption that incorrect lines that are con- 
nected are more likely caused by the same mistake, we treated 
all the incorrect lines as an undirected graph and defined 
each component in the graph as one "site” of mistake. A 
component here has the same definition of a component in 
an undirected graph, a subgraph in which any two vertices 
are connected by paths, and which is connected to no ad- 
ditional vertices in the supergraph [46]. As an example, in 
Figure 4, there are a total of four error components in the 
sketch, three extra lines in different locations, and a discon- 
nected taller stack separated from the bottom of the object. 


We constructed three features in this category. The first 
feature is the number of components in the graph made of 
incorrect lines, which is a representation of the number of 
mistake sites in a sketch. Since the size of a component 
represents how severe a mistake is, the second feature is the 
average size of all the error components in a sketch. The 
larger the average component size is, the more severe the 
mistakes are on average. The last feature is the maximum 
size difference among all error components, which reflects 
the range of severity across multiple mistake sites in a sketch. 


The last set of features describes the general characteristics 
of the sketch. One feature is whether the student uses the 
same type of sketching grid as the answer key. Another 
feature is whether the sketch is empty. If it is empty, it 
indicates either the student did not attempt the question or 
accidentally skipped the question. 


3.3 Model Construction 


As there was no prior framework or knowledge on how to 
categorize the erroneous sketches, it was not possible to ob- 
tain labels (ground truth) describing each answer. As such, 
we used an unsupervised clustering algorithm to identify cat- 
egories of erroneous answers from existing data. Based on 
prior observation of the data, we hypothesized that the fea- 
tures of each cluster should have a sphere-like shape. There- 
fore, we used k-means clustering with squared Euclidean dis- 
tance. The algorithm aims to assign all the data points into 
a specified number of clusters such that every data point is 


Figure 5: Examples of mistakes in Cluster 0, having 
one minor mistake. The sketch on the left is the 
answer key. 


in the cluster with the nearest mean. Ideally, data points 
that have similar values across all the features are grouped 
in one cluster. 


After feature extraction, we performed further data normal- 
ization as the first step of model construction. Since the 
k-means clustering algorithm is sensitive to the scale of the 
features, we normalized each of the three features (Number 
of Components, Average Size of Components, and Maxi- 
mum Difference between Size of All Components) into the 
unit interval respectively across all data, so that they were 
on the same scale as the other features that were either in 
percentages or in a boolean format. 


We performed parameter tuning to decide on the optimal 
number of cluster k. We started with two clusters and re- 
peatedly increased the number of clusters by one. We evalu- 
ated the choice of k using two criteria. The main criterion we 
used to evaluate the quality of the clustering results was how 
interpretable a new cluster was and whether it could help 
us provide more specific and actionable feedback. Another 
complementary criterion for evaluation was the Silhouette 
score, measuring the quality of the clusters based on the co- 
hesion of the separation of the identified clusters (Silhouette 
score ranges from -1 to 1). We valued the interpretability of 
a cluster over a higher Silhouette score. Therefore, as long 
as the Silhouette score remained at an acceptable level, we 
increased k until the interpretation of the newly generated 
cluster did not make sense or did not differ much from the 
existing clusters. 


4. RESULTS 


Our clustering approach identified a set of six clusters re- 
lated to categories of erroneous answers in free-hand sketch- 
ing problems, as listed in Table 1. The 6 clusters are ordered 
based on the severity of the errors in the table. The clus- 
tering model yields a Silhouette score of 0.6659, which is a 
reasonable value. 


Cluster 0 is the most common cluster in the dataset. From 
the centroid value, we can see that the sketches in this cluster 
only have one mistake (Number of Component = 1) with 
about two incorrect lines (Avg Component Size = 1.89). The 
centroid values suggested that a large portion of the errors 
had only one minor mistake, which was most likely due to 
drawing errors such as forgetting an edge at the corner, or 
drawing an extra edge on a plane (see examples in Fig 5). 


Cluster 1, the second-largest cluster in the dataset, differs 
from Cluster 0 mainly by the number of mistakes in the 
sketch. On average, there are 2.21 mistake components in 
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Cluster 
Size 


Interpretation 


Perc 
Missing 


Perc 
Extra 


218 


Have one minor mistake 


2.39% 


2.16% 


65 


Have more than one minor mistakes 


4.13% 


10.46% 


30 


Have both major and minor 
mistakes, mostly minor mistakes 


20.61% 


32.14% 


15 


Have both major and minor 


37.82% 


22.46% 


mistakes, mostly major mistakes 


More than half of the sketch as a 
whole is completely wrong 


Empty sketch 


Table 1: Clustering Results Summary Table: The size, interpretation and centroid of each cluster are shown in 
the table. The centroid values are transformed back to its original scale if unit normalization was performed. 
Values are color-coded with different shades of red, representing low values to high values) 


Figure 6: Examples of mistakes in Cluster 1, having 
multiple minor mistakes. The sketches with a white 
background are the answer keys. 


the sketch. The average size of 3.04 lines of the components 
suggests that these are still minor mistakes with three incor- 
rect lines on average. It is reasonable to interpret Cluster 
1 as sketches that have several minor mistakes. Examples 
of this category are shown in the examples in Fig 6. Even 
though both Cluster 0 and Cluster 1 contain minor errors, 
they are different enough because students in Cluster 0 make 
one small mistake likely due to being careless. In contrast, 
those in Cluster 1 may have misconceptions that are causing 
a series of mistakes. 


Cluster 2 and 3 are quite different from Cluster 0 and Clus- 
ter 1. Both of them have a much higher Percentage of Miss- 
ing Lines and Percentage of Extra lines compared to Clus- 
ter 0 and 1, suggesting more severe mistakes in the sketch. 
More severe errors are more likely to be due to an incorrect 
structure at specific parts of the sketch rather than careless 
mistakes. These two clusters both have a high number of 
components (3.70 and 2.80 for Cluster 2 and 3 respectively), 
suggesting a series of mistakes across the sketch. Cluster 2 
and 3 are different in two perspectives. First, Cluster 2’s 
average component size is small (5.13), while Cluster 3’s av- 
erage component size is a lot bigger (10.69). Second, Cluster 
3 has a massive difference in size across the different com- 
ponents (15.73), while Cluster 2 has a medium difference 
of 5.63. These differences suggest that within the series of 
mistakes in a sketch in Cluster 2, more of them are minor, 


Figure 7: Examples of mistakes in Cluster 2, having 
multiple minor mistakes and a small number of ma- 
jor mistakes. The sketches with a white background 
are the answer keys. 


and there is only a small proportion of major mistakes, as 
shown in Figure 7. On the other hand, a sketch in Cluster 
3 has mainly major mistakes and fewer minor mistakes, as 
shown in Figure 8. The major mistakes in Cluster 3 are also 
more severe than those in Cluster 2 on average. 


Cluster 4 has 80% of the lines missing and 67% extra lines, 
a lot higher than the previous clusters. Interestingly, most 
of the sketches in this cluster have only one component in 
their mistake (1.05 components on average), with an average 
size of 45.35 lines. These features suggest that there is one 
substantial mistake that spans over half of the sketch, which 
is often due to either an utterly wrong structure or a wrong 
orientation. For example, both examples in Fig 9 have the 
correct structure but wrong orientations. 


Lastly, Cluster 5 contains empty answers, either due to the 
student not attempting a question or accidentally skipping 
it. Even though the cluster size is small, with only 3 data 
points due to the low number of empty answers, it is distinct 
enough from all the other clusters to be on its own. 


Overall, we considered the erroneous answer categories de- 
tected to be intuitive and well-defined. They are distinct in 
the severity and characteristics of the mistakes. Being able 
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Figure 8: An example of mistake in Cluster 3, hav- 
ing multiple major and minor mistakes, but mainly 
major mistakes. The sketch on the left is the answer 
key. 
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Figure 9: Examples of mistakes in Cluster 4, having 
one huge cluster of mistake. The sketch with a white 
background is the answer key. 


to automatically identify six categories of erroneous answers 
demonstrated the potential advantage of using an unsuper- 
vised approach in answer categorization than a supervised 
learning approach that tries to align the model capability 
with human judgment of the answer categories, which could 
often only yield up to three clearly defined categories [24]. 
Additionally, we did not observe any significant difference 
between the frequency distribution of the error categories 
across the different types of questions in our dataset, i.e., 
the frequency of each answer category did not differ signifi- 
cantly across all five types of sketching questions, suggesting 
the generalizability of the error categories to more variety of 
questions. 


5. DISCUSSION 


5.1 Evaluation Criteria for Sketching 

Due to the lack of prior work on erroneous answer cate- 
gories in complex free-hand sketching problems, there is no 
currently available set of criteria to evaluate the degree of 
correctness of a complicated sketching answer. In multiple 
offerings of the spatial visualization training in the past in 
our school, an instructor either used an “all or nothing” eval- 
uation approach, or used a subjective standard on one or two 
dimensions to judge a sketch, e.g., taking off 0.5 point for 
each missing or extra line up to a maximum of 1 point, tak- 
ing off 1.5 points any time when not all features of the top, 
front, and right sides are correct. These evaluation schemes 
are too coarse to reflect the degree of correctness of a sketch 
accurately. The results of our clustering analysis provide 
promising results towards the development of a more com- 
prehensive view on how to evaluate a sketch using a scale of 
multiple levels. 


Our model demonstrated that more than one dimension is 
needed concurrently to provide a nuanced interpretation of 
the state of a sketch. In our model, the percentages of miss- 
ing, extra lines or lines with the wrong type, the number 
of mistakes sites, the average size of the mistakes, and how 
different the various mistakes sites are in a sketch are used 
in combination with one another to determine the degree of 
correctness and the type of errors. For example, a distinc- 
tion between Cluster 2 and 3 suggests that with a similar 
percentage of incorrect lines, the number of mistakes com- 
ponents and the average size of the components brings addi- 
tional insights into whether a sketch contains a large number 
of minor mistakes or a small number of major mistakes. As 
another example, even though Cluster 0 and Cluster 1 have 
a similar average size of mistakes, the number of mistake 
sites suggests that students in Cluster 1 may have a more 
systematic misconception than those in Cluster 0 who likely 
commit a mistake due to carelessness. 


Our approach could also be used to define minor mistakes 
versus major mistakes in a sketch for a group of sketch- 
ing questions with similar size and complexity. Without a 
systematic review of all the mistakes in a group of sketch- 
ing questions, it is hard for an instructor to draw an objec- 
tive line between an error that is significant and one that is 
not. As a result, the evaluation criteria may be overly strict 
or overly generous. The clustering model computationally 
categorizes what it considers as minor and major mistakes 
based on the optimal separation principle. Its outcome can 
serve as analytical support for an instructor’s grading deci- 
sion. 


5.2 Potential Intervention 

Since one of the motivations to construct this model is to 
provide real-time, customized, and actionable formative feed- 
back, we propose potential customized intervention mes- 
sages for each erroneous answer category. Based on the best 
practices of offering formative feedback [36], each of the mes- 
sages follow a similar structure of (1) first letting the student 
know how far they are from the correct answer, (2) describ- 
ing what types of mistake there are, and (3) suggesting ways 
for the student to approach solving the errors. A summary 
of the interventions is provided in Table 2. 


Students having answers that fall into Cluster 0 or Cluster 
1, which consist of having one or more minor errors, under- 
stand what the object should look like structure-wise. When 
the system tells them that they are wrong, they may find 
it confusing since they are likely confident in their answer. 
Hence, the feedback message could first assure the students 
that they have got the general structure of the object cor- 
rect. Then, the system could let the students know that 
they have X number of minor mistakes, where X is the fea- 
ture Number of Components. The feedback may also include 
whether they have some missing lines, extra lines, or lines of 
the wrong type. Lastly, the feedback message would suggest 
the students check for details of their drawing by listing out 
the common reasons for such errors, such as extra edges on 
a flat plane, missing edges at a corner. 


If the answer falls within Cluster 2 or Cluster 3, the feedback 
message should be different from that for Cluster 0 and 1 
because there is at least one major mistake in the answer, 
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Cluster Cluster 


ID Size Interpretation 
0 218 Have one minor mistake 
Have more than one minor 
1 65 : 
mistakes 
Have some major and minor 
2 30 mistakes, mostly minor 
mistakes 
Have some major and minor 
3 15 mistakes, mostly major 
mistakes 
More than half of the sketch as 
4 39 : 
a whole is completely wrong 
5 3 Empty sketch 


Potential Intervention 


Encourage students that they get the general structure correct 

Inform students the number of minor mistake sites they have 

Suggest students to check for detail errors and list the common reasons for 
such errors, é.g. extra edges on a flat plane, missing edges at a corner 


Encourage students that they are heading towards the right direction 

Inform students the number of minor and major mistake sites they have 
Suggest students to revisit some parts of the structure 

Suggest students to carefully check for drawing errors and list the common 
reasons for such errors, e.g. extra edges on a flat plane, missing edges at a 
corner 


If students have the correct structure but a wrong orientation: 

e Encourage students that they get the general structure correct 

e Inform them that they may have drawn it in an incorrect orientation 
If students have an incorrect structure: 

e __ Let students know that they have the wrong idea for the structure 
e Suggest students to rethink about the structure from the beginning 
e Provide hints for the students if available 


If students did not make an effort, encourage them to attempt the question 
If students forgot to submit a sketch, remind them to submit in the next attempt 


Table 2: Interventions Summary Table 


likely due to a structural error. The students in these two 
cases are mostly on the right track in terms of the general 
structure of the sketch. Hence, the feedback message could 
first encourage them that they are heading in the right direc- 
tion. The system could then say that the sketch has X minor 
mistakes and Y major mistakes, where X is the Number of 
Components with a size smaller than the Average Compo- 
nent Size of the cluster centroid, and Y is the Number of 
Components with a size larger than the average. Finally, 
the intervention message could suggest the student first re- 
visit the structure in detail to identify the major mistake, 
and then carefully check for drawing errors referring to a list 
of common minor mistakes. 


For a student that falls into Cluster 4, it is likely that the 
student is either on the wrong track entirely or uses a wrong 
orientation. The system can perform a further check to com- 
pare the student’s answer to other possible orientations and 
see if it belongs to the case of having a wrong orientation. 
If it is, the feedback message will remind the student that 
the structure of the sketch is mostly correct, but the orien- 
tation is incorrect. If it is not the case of having a wrong 
orientation, the feedback message will remind the students 
that they may have the wrong idea for the sketch, and they 
should reconsider the question from the beginning. The sys- 
tem could consider providing hints to the students as well 
in this case. 


Lastly, if a student submits an empty sketch, the system can 
check the time spent on the question to determine whether 
the student did not attempt the question at all or forgot to 
click the submit button. If the student did not attempt the 
question, the system would encourage the student to make 
an effort in attempting to solve the problem. If the student 
forgot to submit the answer, the feedback message would 


remind them to submit in the next attempt. 


5.3. Generalizability of the Proof-of-concept Ap- 


proach 

Our clustering model is more than a single model that works 
only in a specific scenario. It is a proof-of-concept approach 
for the evaluation of a complex free-hand sketch based on ab- 
stract features. Our contributions to the evaluation scheme 
of sketching answers have the potential to be generalized 
from spatial visualization training to more fields that involve 
free-hand technical drawings in various Engineering and Sci- 
ence subjects, such as circuit diagrams in Electrical Engi- 
neering, engine models in Mechanical Engineering, build- 
ing plans in Architecture, and structural formula in Organic 
Chemistry. Technical drawing is similar to spatial visualiza- 
tion sketching in the sense that they both follow strict rules 
of sketching and are often drawn on grid paper to ensure a 
consistent proportion and orientation. Technical drawings 
in these fields usually start from a fundamental practice of 
drawing and modeling using practice problems that have a 
limited number of correct answers. With the presence of an- 
swer keys, our unsupervised clustering approach is flexible 
and easy to be retrained on new datasets to adapt to new 
types of sketches, even with additional features developed 
based on the learning goal of the type of sketches. 


On the other hand, for technical drawing that involves a cre- 
ative component or pure creative drawing, it may be harder 
to apply our approach directly. In evaluating creative draw- 
ing that does not have a limited number of correct answers, a 
mistake may be more subjective, and the evaluation may ex- 
tend beyond getting a sketch correct to being functional, op- 
timal, creative or aesthetic. The clustering approach based 
on abstract features of a sketch, however, may be used for 
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other purposes in this case. For example, our approach could 
be used to group sketches with similar characteristics to- 
gether for the convenience of human graders, especially in a 
large course with limited human resources, such as Massive 
Open Online Courses. Reconsideration in feature engineer- 
ing would be needed to achieve the new goals. 


6. LIMITATIONS AND FUTURE WORK 


The current erroneous answer categories do not take into 
account specific reasons that lead to a particular error in 
an answer. There may be multiple reasons for a student to 
end up with mistakes in the same category. To the best of 
our knowledge, there is neither prior work that studies the 
common misconceptions in spatial visualization sketching, 
nor cognitive models that describe the process of this task. 
The closest available work in cognitive models for spatial 
ability focuses on how people solve multiple choice spatial 
visualization questions, i.e., when candidate solutions are 
provided [14, 11, 31]. These models do not cover the process 
of generating a spatial object from scratch, which is what 
sets spatial visualization sketching apart from the traditional 
spatial ability tests. Hence, our proposed model is unable to 
distinguish the errors by their causes. Future research con- 
ducting qualitative interviews with students to understand 
the reasons why an error occur could provide valuable in- 
sights towards identifying not only broad categories of erro- 
neous answer, but also the causes behind various error cate- 
gories. It would also be beneficial to create cognitive models 
to understand systematically the strategies students used to 
solve these problems. These information would be valuable 
in further developing other features that could distinguish 
errors according to their underlying cause, for example, by 
leveraging the temporal sequence of actions executed by the 
student leading to their error. Improving current models 
to include information about the most probable cause of an 
error would be beneficial in generating formative feedback 
that goes beyond providing information about the nature of 
the students’ error, and integrates conceptual information 
to support students in addressing misconceptions. 


The current training data for the model only involved 14 
students, which is a relatively small sample. As such, the 
current model can be seen as a proof-of-concept for the feasi- 
bility of erroneous answer categorization. Applying the same 
approach to a larger population of students will be necessary 
to validate the stability of the model and ensure that there 
are no additional answer categories that may not have been 
included in our current dataset. Future studies can re-train 
and test the model on a larger population to confirm the ex- 
istence of the answer categories identified within the current 
study. Since the training process of the model is simple, 
re-training the model based on another dataset would be 
straightforward. 


Another next step for this research is to deploy the model 
in an online training platform and conduct user testing to 
examine the effectiveness and accuracy of the categorization 
and intervention. Last but not least, the method proposed 
in this study is designed to be flexible and be applied to 
other disciplines. Future work in other disciplines, such as 
evaluating circuit diagrams in Electrical Engineering, engine 
models in Mechanical Engineering, building plans in Archi- 
tecture, and structural formula in Organic Chemistry, will 


need to be conducted to evaluate the extent to which the 
proposed method generalizes to new topics. 


7. CONCLUSION 


In conclusion, this paper presents a clustering model as a 
solution to categorize erroneous answers in complex free- 
hand sketching questions in spatial visualization training. 
Eight abstract features were developed and proven to be ef- 
fective in the categorization of erroneous answers, including 
percentages of various types of incorrect lines, number of 
mistake components, and metrics of the size of the compo- 
nents. The clustering model detected six answer categories 
based on the severity and scale of the mistakes. With these 
detected categories, an online training platform will be able 
to present customized and actionable formative feedback in 
real-time. Moreover, our approach suggested a new and 
comprehensive set of evaluation criteria to assess a sketch, 
which could potentially be generalized to other disciplines 
that require sketching practices. 
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